Language training

ABSTRACT

A speech synthesizer (3) produces prompts in the voice of a native speaker of a language to be learned to which the student replies or imitates. A phrase recognizer (1) employs keyword recognition to generate from the student&#39;s prompted response an original speech template spoken by the student. Thereafter, interactive dialogue takes place. The student&#39;s progress in that dialogue is monitored by measuring the deviation of the student&#39;s current speech from his original speech template. When this deviation is sufficiently large so that the recognizer (1) no longer recognizes what the student is saying, the system retrains and updates the student&#39;s speech template. In another embodiment, the system includes a display which shows the native speaker&#39;s mouth shape while the words to be imitated are spoken by the speech synthesizer (3). Also provided are a video pick-up and analyzer for analyzing the shapes of the student&#39;s mouth to give the student visual feedback.

This invention relates to apparatus and methods for trainingpronunciation; particularly, but not exclusively, for training thepronunciation of second or foreign languages.

One type of system used to automatically translate speech betweendifferent foreign languages is described in our European publishedpatent application number 0262938A. This equipment employs speechrecognition to recognise words in the speaker's utterance, patternmatching techniques to extract meaning from the utterance and speechcoding to produce speech in the foreign tongue.

This invention uses similar technology, but is configured in a differentway and for a new purpose, that of training a user to speak a foreignlanguage.

This invention uses speech recognition not only to recognise the wordsbeing spoken but also to test the consistency of the pronunciation. Itis a disposition of novice students of language that, although they areable to imitate a pronunciation, they are liable to forget, and willremain uncorrected until they are checked by an expert. A machine whichwas able to detect mispronunciation as well as translation inaccuracieswould enable students to reach a relatively high degree of proficiencybefore requiring the assistance of a conventional language teacher toprogress further. Indeed, very high levels of linguistic skill areprobably not required in the vast majority of communication tasks, suchas making short trips abroad or using the telephone, and computer aidedlanguage training by itself may be sufficient in these cases.

Conventional methods either involve expensive skilled human teachers, orthe use of passive recordings of foreign speech which do not test thequality of the student's pronunciation.

Some automated systems provide a visual display of a representation ofthe student's speech, and the student is expected to modify hispronunciation until this display matches a standard. This techniquesuffers from the disadvantage that users must spend a great deal of timeexperimenting and understanding how their speech relates to the visualrepresentation.

Another approach (described for example in Revue de Physique Appliqueevol 18 no. 9 Sept 1983 pp 595-610, M. T. Janot-Giorgetti et al,"Utilisation d'un systeme de reconnaissance de la parole cornroe aide al'acquisition orale d'une langue etrangere") employs speaker independentrecognition to match spoken utterances against standard templates. Ascore is reported to the student indicating how well his pronunciationmatches the ideal. However, until speaker independent recognitiontechnology is perfected, certain features of the speaker's voice, suchas pitch, can affect the matching scores, and yet have no relevantconnection with the quality of pronunciation. A student may therefore beencouraged to raise the pitch of his voice to improve his score, and yetfail to correct an important mispronunciation.

Furthermore, current speaker independent recognition technology isunable to handle more than a small vocabulary of words without producinga very high error rate. This means that training systems based on thistechnology are unable to process and interpret longer phrases andsentences. A method of training pronunciation for deaf speakers isdescribed in Procedings ICASSP 87 vol 1 pp 372-375 D. Kewley-Port et al`Speaker-dependant Recognition as the Basis for a Speech Training Aid`.In this method, a clinician selects the best pronounced utterances of aspeaker and these are convened into templates. The accuracy of thespeaker's subsequent pronunciation is indicated as a function of hiscloseness to the templates (the closer the better). This system has twodisadvantages; firstly, it relies upon human intervention by theclinician, and secondly the speaker cannot improve his pronunciationover his previous best utterances but only attempt to equal it.

According to the invention there is provided apparatus for pronunciationtraining comprising;

speech generation means for generating utterances; and

speech recognition means arranged to recognise in a trainee'sutterances, the words from a predetermined selected set of words,

wherein the speech recognition means is arranged to employspeaker-dependent recognition, by comparing the trainee's utterance withtemplates for each word of the set, and the apparatus is arrangedinitially to generate the templates by prompting the trainee to uttereach word of the set and forming the templates from such utterances, theapparatus being further arranged to indicate improvements inpronunciation with increases in the deviation of the trainee'ssubsequent utterances from the templates.

Some non-limitative examples of embodiments of the invention will now bedescribed with reference to the drawings, in which:

FIG. 1 illustrates stages in a method of language training according toone aspect of the invention;

FIG. 2 illustrates schematically apparatus suitable for performing oneaspect of the invention;

FIG. 3 illustrates a display in an apparatus for language trainingaccording to another aspect of the invention.

Referring to FIGS. 1 and 2, upon first using the system illustrated, thestudent is asked by the system (using either a screen and keyboard orconventional speech synthesiser and speaker independant recogniser )which language he wishes to study, and which subject area (eg operatingthe telephone or booking hotels) he requires. The student then has tocarry out a training procedure so that the speaker dependent speechrecogniser 1 can recognise his voice. To this end, the student isprompted in the foreign language by a speech generator 3 employing apre-recorded native speaker's voice to recite a set of keywords relevantto his subject area. At the same time, the source language translationof each word is displayed, giving the student the opportunity to learnthe vocabulary. This process, in effect, serves as a passive learningstage during which the student can practise his pronunciation, and canrepeat words as often as he likes until he is satisfied that he hasimitated the prompt as accurately as he believes he can.

A control unit 2 controls the sequence of prompts and responses.Conveniently, the control unit may be a personal computer (for example,the IBM PC).

These utterances are now used as, or to generate, the first set oftemplates stored in template store 1a to be used by the speechrecogniser 1 to process the student's voice. The templates represent thestudent's first attempt to imitate the perfect pronunciation of therecorded native speaker.

The second stage of the training process simply tests the ability of thestudent to remember the translations and pronunciations of the key wordvocabulary. He is prompted in his source language (either visually, onscreen 4, or verbally by speech generator 3) to pronounce translationsof the keywords he has practised in the previous stage. After each wordis uttered, the speech generator 3 repeats the foreign word recognisedby the recogniser 1 back to the student and displays the source languageequivalent. Incorrect translations are noted for re-prompting later inthe training cycle. The student is able to repeat words as often as hewishes, either to refine his pronunciation or to correct a machinemisrecognition. If the recogniser 1 consistently (more than, say, 5times) misrecognises a foreign word, either because of a low distancescore or because two words are recognised with approximately equaldistances, the student will be asked to recite this word again(preferably several times), following a native speaker prompt from thegenerator 3, so that a new speech recognizer template can be produced toreplace the original template in store 1a. Such action in fact indicatesthat the student has changed his pronunciation after having heard theprompt several more times, and is converging on a more accurateimitation of the native speaker. This method has the advantage over theprior art that the trainee's progress is measured by his deviation fromhis original (and/or updated) template, rather than by his convergenceon the native speaker's template, thus eliminating problems due topitch, or other, differences between the two voices. Once the student issatisfied that he has mastered the key word vocabulary, he may move tothe third training stage.

The student is now prompted in his own language (either visually onscreen 4 or verbally through generator 3) and may be asked to carry outverbal translations of words or complete phrases relevant to his subjectarea of interest. Alternatively, these prompts may take the form of adialogue in the foreign language to which the student must respond. Oneuseful method of prompting is a `storyboard` exercise using a screendisplay of a piece of text, with several words missing, which thestudent is prompted to complete by uttering what he believes are themissing words. The system now preferably operates in the same manner asthe phrase-based language translation system (European PublishedApplication No 0262938) and recognises the pre-trained keywords in orderto identify the phrase being uttered. The system then enunciates thecorrect response/translation back to the student in a native speaker'svoice, and gives the student an opportunity to repeat his translation ifit was incorrect, if he was not happy with the pronunciation, or if therecogniser 1 was unable to identify the correct foreign phrase. In theevent that the student is unable to decide whether the recogniser 1 hasassimilated his intended meaning, the source language version of therecognised foreign phrase can be displayed at the same time. Incorrectlytranslated phrases are re-presented (visually or verbally) to thestudent later in the training cycle for a further translation attempt.

If the recogniser 1 repeatedly fails to identify the correct phrasebecause of poor key word recognition and drifting student pronunciation,the student will be asked to recite each key word present in the correcttranslation for separate recognition. If one or more of these keywordsis consistently misrecognised, new templates are generated as discussedabove.

Phrases are presented to the student for translation in an order whichis related to their frequency of use in the domain of interest. Thesystem preferably enables the trainee to suspend training at any pointand resume at a later time, so that he is able to progress as rapidly oras slowly as he wishes.

The preferred type of phrase recognition (described in EuropeanPublished Application No 0262938 and `Machine Translation of Speech`Stentiford & Steer, British Telecom Technology Journal Vol 6 No. 2 April'88 pp 116-123) requires that phrases with variable parameters in themsuch as dates, times, places or other sub-phrases, should be treated ina hierarchical manner. The form of the phrase is first identified usinga general set of keywords. Once this is done, the type of parameterpresent in the phrase can be deduced and a special set of keywordsapplied to identify the parameter contents. Parameters could be nestedwithin other parameters. As a simple example, a parameter might refer toa major city in which case the special keywords would consist of justthese cities. During student training translation, errors in parametercontents can also be treated hierarchically. If the system hasidentified the correct form of phrase spoken by the student, but hasproduced an incorrect parameter translation, the student can then becoached to produce the correct translation of the parameter inisolation, without having to return to the complete phrase.

Parameters are normally selected in a domain of discourse because oftheir occurrence across a wide range of phrases. It is natural thereforethat the student should receive specific training on these items if heappears to have problems with them.

The keywords are selected according to the information they bear, andhow well they distinguish the phrases used in each subject area. Thismeans that it is not necessary for the system to recognise every word inorder to identify the phrase being spoken. This has the advantage that anumber of speech recognition errors can be tolerated before phraseidentification is lost. Furthermore, correct phrases can be identifiedin spite of errors in the wording which might be produced by a novice.It is reasonable to conjecture that, if the system is able to matchattempted translations with their corrected versions, such utterancesshould be intelligible in practice when dealing with native speakers whoare aware of the context. This means that the system tends toconcentrate training on just those parts of the student's diction whichgive rise to the greatest ambiguity in the foreign language. This mightbe due to bad pronunciation of important keywords or simply due to theiromission.

The described system therefore provides an automated learning schemewhich can rapidly bring language students up to a minimum level ofintelligibility, and is especially useful for busy businessmen whosimply wish to expedite their transactions, or holiday makers who arenot too worried about grammatical accuracy.

The correct pronunciation of phrases is given by the recorded voice of anative speaker, who provides the appropriate intonation andco-articulation between words. The advanced student is encouraged tospeak in the same manner, and the system will continue to check eachutterance, providing the word spotting technology employed is able tocope with the increasingly fluent speech.

Referring to FIG. 3, in another aspect of the invention, a visualdisplay of the mouth of the native speaker is provided so as to exhibitthe articulation of each spoken phrase. This display may conveniently beprovided on a CRT display using a set of quantised mouth shapes asdisclosed in our previous European Published Application No. 0225729A. Awhole facial display may also be used.

In one simple embodiment, the display may be mounted in conjunction witha mirror so that the applicant may imitate the native speaker.

In a second embodiment, a videophone coding apparatus of the typedisclosed in our previous European Published Application No. 0225729 maybe employed to generate a corresponding display of the student's mouthso that he can accurately compare his articulation with that of thenative speaker. The two displays may be simultaneously replayed by thestudent, either side by side, or superimposed (in which case differentcolours may be employed), using a time-warp method to align thedisplays.

I claim:
 1. Apparatus for pronunciation training comprising:a speechgenerator for generating utterances to prompt a trainee; a controllerfor generating speech templates from utterances by the trainee inresponse to prompts from the speech generator of words from apredetermined selected set of words; a speech recognizer forrecognizing, in a trainee's current utterances, words from thepredetermined selected set of words by comparing each of the trainee'scurrent utterances with said speech templates; and an output deviceindicating improvement in the trainee's pronunciation based on increasesin a deviation of the trainee's current utterances from the speechtemplates.
 2. Apparatus according to claim 1, wherein the controllerupdates the speech templates from the trainee's current utterances whenthe said deviation exceeds a predetermined value.
 3. Apparatus accordingto claim 1, wherein the speech recognizer recognizes in the trainee'sresponse to the prompt one or more words from the set of words, andwherein the speech generator generates an utterance depending on the oneor more words recognized by the speech recognizer.
 4. Apparatus forpronunciation training according to claim 3, further comprising:phraserecognition means for identifying phrases by a combination and order ofwords from the predetermined selected set, wherein when the trainee isprompted by an utterance generated by the speech generator to utter aphrase, the phrase recognition means recognizes the phrase and selectsthe utterance generated by the speech generator to be a reply to thephrase.
 5. A pronunciation training apparatus according to claim 1,further comprising:a video generator for generating corresponding videoimages of a mouth to prompt a trainee to imitate a correct pronunciationof the generated utterances.
 6. Apparatus according to claim 5, furthercomprising:a video analyzer for analyzing mouth movements of the traineeand displaying corresponding synthesized and analyzed mouth movementsgenerated by the video generator and by the trainee.
 7. Languagetraining apparatus according to claim 1, wherein the speech generatorgenerates utterances in a language in the accent of a native speaker ofthat language.
 8. A method of pronunciation training,comprising;prompting a trainee to pronounce a series of words;generating trainee speech templates from the trainee's pronunciations;prompting a trainee to speak an utterance; analyzing the utterance usingthe trainee speech templates; and assessing improvements inpronunciation by measuring a difference between the utterance and acorresponding trainee speech template, wherein an increase in differencecorresponds to an improvement in pronunciation.
 9. A method according toclaim 8 further comprising the step of:updating the speech template whenthe difference exceeds a predetermined value.
 10. A system for traininga speaker's pronunciation, comprising:a speech synthesizer forgenerating synthesized speech to prompt the speaker to generate speechsamples; control circuiting for determining speech pronunciationcharacteristics from the speaker's initial speech samples generated inresponse to speech synthesizer prompts; processing subsequent speechsamples of the speaker to determine a deviation from the determinedspeech pronunciation characteristics and indicating progress in thespeaker's pronunciation based on an increase in the deviation.
 11. Thesystem according to claim 10, further comprising:means for modifying thederived pronunciation characteristics when the deviation exceeds apredetermined threshold.
 12. The system according to claim 10, whereinsaid speech synthesizer prompts speech from the speaker in a foreignlanguage and the means for receiving receives speech samples from thespeaker in the foreign language.
 13. The system according to claim 10,wherein the speech pronunciation characteristics include key words andphrases.
 14. The system according to claim 10, wherein the speechsynthesizer prompts the speaker using words in a native language of thespeaker and the means for processing processes subsequent responsivespeech samples received in a foreign language corresponding totranslations of the native language words.
 15. The system according toclaim 10, wherein the speech synthesizer prompts the speaker using wordsin a foreign language and the control circuitry processes subsequentspeech samples received from the speaker in a foreign language.
 16. Thesystem according to claim 10, wherein the speech synthesizer generateswords and phrases using the accent of a native speaker of a foreignlanguage.
 17. The system according to claim 10, further comprising:avideo generator for generating video images of a human mouth configuredto correspond with proper pronunciation of the synthesized speechsamples to assist the speaker in a response.
 18. A method ofpronunciation training comprising:prompting a trainee using wordsgenerated by a speech generator in a secondary language different than aprimary language of the trainee as pronounced by a native speaker ofsaid secondary language, the trainee pronouncing speech in the secondarylanguage in response to said prompting; deriving trainee speechtemplates from the speech in the secondary language pronounced by saidtrainee; prompting dialog responses from said trainee in said secondarylanguage via said speech generator; comparing said dialog responses tocorresponding ones of said trainee speech templates; and indicatingprogress in pronunciation as differences between said dialog responsesand said corresponding trainee speech templates increase.
 19. The methodaccording to claim 18, further comprising:updating said traineetemplates when a difference between said dialog responses and saidtrainee templates exceeds a threshold.
 20. The method according to claim18, wherein said dialog generating step includes:generating a videoimage of said native speaker's mouth shape synchronized with saidgenerated secondary dialog speech.
 21. The method according to claim 18,further comprising:analyzing shapes of the mouth of said trainee duringsaid dialog responses.